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Abstract 

An extragrammatical sentence is what a 
normal parser fails to analyze. It is im- 
portant to recover it using only syntactic 
information although results of recovery 
are better if semantic factors are consid- 
ered. A general algorithm for least-errors 
recognition, which is based only on syn- 
tactic information, was proposed by G. 
Lyon to deal with the extragrammatical- 
ity. We extended this algorithm to recover 
extragrammatical sentence into grammat- 
ical one in running text. Our robust parser 
with recovery mechanism - extended gen- 
eral algorithm for least-errors recognition 
- can be easily scaled up and modified 
because it utilize only syntactic informa- 
tion. To upgrade this robust parser we 
proposed heuristics through the analysis 
on the Penn treebank corpus. The experi- 
mental result shows 68% ~ 77% accuracy 
in error recovery. 

1 Introduction 

Extragrammatical sentences include patently un- 
grammatical constructions as well as utterances that 
may be grammatically acceptable but are beyond 
the syntactic coverage of a parser, and any other 
difficult ones that are encountered in parsing (Car- 
bonell and Hayes, 1983). 

I am sure this is what he means. 
This is, I am sure, what he means. 

The progress of machine does not stop even a day. 
Not even a day does the progress of machine stop. 

Above examples show that people are used to 
write same meaningful sentences differently. In ad- 
dition, people are prone to mistakes in writing sen- 
tences. So, the bulk of written sentences are open 
to the extragrammaticality. 

In the Penn treebank tree-tagged corpus(Marcus, 
1991), for instance, about 80 percents of the rules are 



concerned with peculiar sentences which include in- 
versive, elliptic, parenthetic, or emphatic phrases. 
For example, we can drive a rule VP — > vb NP 
comma rb comma PP from the following sentence. 

The same jealousy can breed confusion, 
however, in the absence of any authoriza- 
tion bill this year. 

( 

(S 

(NP The/dt 

(ADJP same/jj) jealousy/nn) can/md 
(VP breed/vb 

(NP conf usion/nn) ,/, however/rb ,/, 

(PP in/in 
(NP 

(NP the/dt absence/nn) 
(PP of/in 

(NP any/dt authorization/nn bill/nn)) 
(NP this/dt year /mi) ) ) ) ) 

./.) 

A robust parser is one that can analyze these ex- 
tragrammatical sentences without failure. However, 
if we try to preserve robustness by adding such rules 
whenever we encounter an extragrammatical sen- 
tence, the rulebase will grow up rapidly, and thus 
processing and maintaining the excessive number of 
rules will become inefficient and impractical. There- 
fore, extragrammatical sentences should be handled 
by some recovery mechanism(s) rather than by a set 
of additional rules. 

Many researchers have attempted several tech- 
niques to deal with extragrammatical sentences such 
as Augmented Transition Network(ATN) (Kwasny 
and Sondheimer, 1981), network-based semantic 
grammar (Hendrix, 1977), partial pattern match- 
ing (Hayes and Mouradian, 1981), conceptual case 
frame (Schank et al., 1980), and multiple cooperat- 
ing methods (Hayes and Carbonell, 1981). Above 
mentioned techniques take into account various se- 
mantic factors depending on specific domains on 
question in recovering extragrammatical sentences. 
Whereas they can provide even better solutions in- 
trinsically, they are usually ad-hoc and are lack of 



extensibility. Therefore, it is important to recover 
extragrammatical sentences using syntactic factors 
only which are independent of any particular sys- 
tem and any particular domain. 

Mellish (Mellish, 1989) introduced some chart- 
based techniques using only syntactic information 
for extragrammatical sentences. This technique has 
an advantage that there is no repeating work for 
the chart to prevent the parser from generating the 
same edge as the previously existed edge. Also, 
because the recovery process runs when a normal 
parser terminates unsuccessfully, the performance of 
the normal parser does not decrease in case of han- 
dling grammatical sentences. However, his experi- 
ment was not based on the errors in running texts 
but on artificial ones which were randomly gener- 
ated by human. Moreover, only one word error was 
considered though several word errors can occur si- 
multaneously in the running text. 

A general algorithm for least- errors recognition 
(Lyon, 1974), proposed by G. Lyon, is to find out 
the least number of errors necessary to successful 
parsing and recover them. Because this algorithm is 
also syntactically oriented and based on a chart, it 
has the same advantage as that of Mellish's parser. 
When the original parsing algorithm terminates un- 
successfully, the algorithm begins to assume errors of 
insertion, deletion and mutation of a word. For any 
input, including grammatical and extragrammatical 
sentences, this algorithm can generate the resultant 
parse tree. At the cost of the complete robustness, 
however, this algorithm degrades the efficiency of 
parsing, and generates many intermediate edges. 

In this paper, we present a robust parser with a re- 
covery mechanism. We extend the general algorithm 
for least-errors recognition to adopt it as the recov- 
ery mechanism in our robust parser. Because our ro- 
bust parser handle extragrammatical sentences with 
this syntactic information oriented recovery mecha- 
nism, it can be independent of a particular system 
or particular domain. Also, we present the heuris- 
tics to reduce the number of edges so that we can 
upgrade the performance of our parser. 

This paper is organized as follows : We first re- 
view a general algorithm for least-errors recognition. 
Then we present the extension of this algorithm, and 
the heuristics adopted by the robust parser. Next, 
we describe the implementation of the system and 
the result of the experiment of parsing real sen- 
tences. Finally, we make conclusion with future di- 
rection. 

2 Algorithm and Heuristics 

2.1 General algorithm for least-errors 
recognition 

The general algorithm for least-errors recognition 
(Lyon, 1974), which is based on Earley's algorithm, 
assumes that sentences may have insertion, deletion, 
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Figure 1: SCAN processing 



and mutation errors of terminal symbols. The ob- 
jective of this algorithm is to parse input string with 
the least number of errors. 

A state used in this algorithm is quadruple (p, j, 
f e), where p is a production number in grammar, 
j marks a position in RHS(p), f is a start position 
of the state in input string, and e is an error value.EI 
A final state (p, p+1, f, e) denotes recognition of a 
phrase RHS(p) with e errors where p is a number 
of components in rule p. A stateset S(i), where i is 
the position of the input, is an ordered set of states. 
States within a stateset are ordered by ascending 
value of j, within a p within a / ; / takes descending 
value. 

When adding to statesets, if state (p, j, f e) is a 
candidate for admission to a stateset which already 
has a similar member (p, j, f e') and e' < e, then 
(p, j, f e) is rejected. However, if e' > e, then (p, j, 
f, e ') is replaced by (p, j, f e). 

The algorithm works as follows : A procedure 
SCAN is carried out for each state in S(i). SCAN 
checks various correspondences of input token t(i) 
against terminal symbols in RHS of rules. Once 
SCAN is done, COMPLETER substitutes all final 
states of S(i) into all other analyses which can use 
them as components. 
SCAN 

SCAN handles states of S(i), checking each input 
terminal against requirements of states in S(i) and 
various error hypotheses. Figure [j] shows how SCAN 
processes. 

Let c(p,j) be j-th component of RHS(p) and t(i) 
be i-th word of input string. 

• perfect match : 

If c(p,j) = t(i) then add (p, j+1, f, e) to S(i+1 ) 
if possible. 

• insertion- error hypothesis : 

Add (p, j, f e+ctinsertion) to S(i+1) if possible. 
(^insertion is the cost of an insertion-error for a 
terminal symbol. 



J Lyon said that e is an error count 



• deletion- error hypothesis : 

If c(p,j) is terminal, then add (p, j+1, f. 
e+otdeietion) to S(i) if possible. 
ctde.ie.Uon is the cost of a deletion-error for a ter- 
minal symbol. 

• mutation- error hypothesis : 

If c(p,j) is terminal but not equal to t(i), then 
add (p, j+1, f, e+a muta tion) to S(i+1) if pos- 
sible. 

dmutation is the gost of a mutation-error for a 
terminal symbol.13 

COMPLETER 

COMPLETER handles substitution of final states 
in S(i) like that of original Earley's algorithm. Each 
final state means the recognition of a nonterminal. 



2.2 



Extension of least-errors recognition 
algorithm 



The algorithm in section 2.1 can analyze any input 
string with the least number of errors. But this algo- 
rithm can handle only the errors of terminal symbols 
because it doesn't consider the errors of nonterminal 
nodes. In the real text, however, the insertion, dele- 
tion, or inversion of a phrase - namely, nonterminal 
node - occurs more frequently. So, we extend the 
original algorithm in order to handle the errors of 
nonterminal symbols as well. 

In our extended algorithm, the same SCAN as 
that of the original algorithm is used, while COM- 
PLETER is modified and extended. Figure || shows 
the processing of extended-COMPLETER. In fig- 
ure g, [NP] denotes the final state whose rule has 
NP as its LHS. In other words, it means the recog- 
nition of a noun phrase. 
extended-COMPLETER 
If there is a final state s' = (p',p' 

• phrase perfect match 
If there exists a state s" = ( 
, k < i and c(j>, j) = LHS(p') then add s 
(p,j + 1, x, e + e') into S(i). 

• phrase insertion- error hypothesis i 

If there exists a state s" — (p,j,x,e) in S(k) 
then add s = (p, j, x,e + ^insertion) into S(i) if 
possible. 

^insertion is the cost of a insertion-error for a 
nonterminal symbol. 



1 , k, e') in S(i), 
J,j,x, e) in S(k) 



Otinsertion^ ^deletion? Ctmutation are 3,11 Strictly 1 111 

Lyon's original paper. 

3 In fact, there are cases that an inserted phrase can- 
not be constructed to form a nonterminal node. In 
phrase insertion-error hypothesis of figure ^, the orig- 
inal sentence is "Other countries, including West Ger- 
many, may have . . ." , where the inserted phrase VP 
is surrounded by commas. So, the substring^ comma 
VP comma ) should be dealt with as a constituent in 
extended-COMPLETER. In fact, we implemented the 
algorithm to allow substring insertions as well as inser- 
tions of nonterminal nodes. 



s" = [ VP -> vb . NP PP 



They 




s = [ VP -> vb NP . PP ] 

< Phrase Perfect Match > 



[ S -> NP . md VP 
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> Other countries • including West Germany 

k 



s = [ S -> NP . md VP ] 

< Phrase Insertion-error Hypothesis > 



s" = [ S -> NP .VP PP ] 

/fNpfV' = [PP] 

for them k > an d 4 they m for it. 

k 



s = [ S -> NP VP . PP] 



< Phrase Deletion-error Hypothesis > 



Figure 2: Examples of extended-COMPLETER pro- 
cessing 



• phrase deletion- error hypothesis 

If there exists a state s" — (p,j,x,e) in S(k) 
and c(p,j) is a nonterminal then add s = 
(p, j + 1, x, e + ^deletion) into S(k) if possible. 
^deletion is the cost of & deletion-error for a non- 
terminal symbol. 

• phrase mutation- error hypothesis @ 

If there exists a state s" — (p,j,x,e) in S(k) 
and c(p, j) is a nonterminal but not equal to 
L(p') then add s = (p,j + l,x,e + mu tation) 
into S(i) if possible. 

Pmutation is the cost of a mutation-error for a 
nonterminal symbol. 

The extended least-errors recognition algorithm 
can handle not only terminal errors but also nonter- 
minal errors. 

2.3 Heuristics 

The robust parser using the extended least-errors 
recognition algorithm overgenerates many error- 
hypothesis edges during parsing process. To cope 
with this problem, we adjust error values according 
to the following heuristics. Edges with more error 
values are regarded as less important ones, so that 



4 We know that the phrase mutation-error hypothesis 
is not meaningful in the real text because we cannot 
find out any example of phrase mutation-error in the 
corpus. So we didn't implement the phrase mutation- 
error hypothesis. 



those edges are processed later than those of less 
error values. 

• Heuristics 1: error types 

The analysis on 3,538 sentences of the Penn 
treebank corpus WSJ shows that there are 498 
sentences with phrase deletions and 224 sen- 
tences with phrase insertions. So, we assign 
less error value to the deletion-error hypothesis 
edge than to the insertion- and mutation-errors. 

a < f3 

^deletion ^ ^insertion ^ ^mutation 
deletion insertion 

where a is the error cost of a terminal symbol, 
f3 is the error cost of a nonterminal symbol. 

• Heuristics 2: fiducial nonterminal 

People often make mistakes in writing En- 
glish. These mistakes usually take place rather 
between small constituents such as a verbal 
phrase, an adverbial phrase and noun phrase 
than within small constituents themselves. The 
possibility of error occurrence within noun 
phrases are lower than between a noun phrase 
and a verbal phrase, a preposition phrase, an 
adverbial phrase. So, we assume some phrases, 
for example noun phrases, as fiducial nonter- 
minals, which means error-free nonterminals. 
When handling sentences, the robust parser 
assings more error values((5i) to the error hy- 
pothesis edge occurring within a fiducial non- 
terminal. 

• Heuristics 3: kinds of terminal symbols 

Some terminal symbols like punctuation sym- 
bols, conjunctions and particles are often mis- 
used. So, the robust parser assigns less error 
values(— 62) to the error hypothesis edges with 
these symbols than to the other terminal sym- 
bols. 

• Heuristics 4: inserted phrases between 
commas or parentheses 

Most of inserted phrases are surrounded by 
commas or parentheses. For example, 

a. They're active , generally , at night or on damp, 
cloudy days. 

b. All refrigerators , whether they are defrosted 
manually or not , need to be cleaned. 

c. I was a last-minute ( read interloping ) attendee 
at a French journalism convention ■ ■ • 

We will assign less error values(— #3) to the 
insertion-error hypothesis edges of nontermi- 
nals which are embraced by comma or paren- 
thesis. 



S\ and 62 are weights for the error of terminal nodes, 
and 63 is a weight for the error of nonterminal nodes. 

The error value e of an edge is calculated as fol- 
lows. All error values are additive. 
The error value e for a rule X — > a\A\a,2 • • ■ ciiAj, 
where a is a terminal node and A is a nonterminal 
node, is 

1. e = Y!l e T + Yfl e NT 

2 _ f a + 5i — 62 if terminal error 
T — <l q otherwise 

f - 63 + e-cWdd if nonterminal 
3. eNT = i error 

{ e-cMid otherwise 

where OL G {(^insertion: ^deletion j ^mutation}: £ 

{Pin sertion j ^deletion} and e c hud is an error value of 
a child edge. 

By these heuristics, our robust parser can process 
only plausible edges first, instead of processing all 
generated edges at the same time, so that we can 
enhance the performance of the robust parser and 
result in the great reduction in the number of resul- 
tant trees. 

3 Implementation and Evaluation 
3.1 The robust parser 

Our robust parsing system is composed of two mod- 
ules. One module is a normal parser which is the 
bottom-up chart parser. The other is a robust parser 
with the error recovery mechanism proposed herein. 
At first, an input sentence is processed by the nor- 
mal parser. If the sentence is within the grammatical 
coverage of the system, the normal parser succeed to 
analyze it. Otherwise, the normal parser fails, and 
then the robust parser starts to execute with edges 
generated by the normal parser. The result of the 
robust parser is the parse trees which are within the 
grammatical coverage of the system. The overview 
of the system is shown in figure 0. 




Figure 3: The overview of the system 



3.2 Experimental result 

To show usefulness of the robust parser proposed in 
this paper, we made some experiments. 







Table 1: The results of the robust parser on WSJ 



Experiment 1 : WSJ 410 sentences 




with Heuristics 


without Heuristics 


Average sentence length 
Average processing time 
Average number of edges 
Accuracy (%) 
no-crossing sentences 
% of < 1-crossing sentences 
% of < 2-crossing sentences 


16.27 words (2-25 words) 

6.52 sec 

7726.03 

77.1 

23.28% 

40.52% 

55.17% 


16.27 words (2-25 words) 

22.47 sec 

10346.6 

72.8 

20.28% 

37.14% 

48.57% 



• Rule 

We can derive 4,958 rules and their frequen- 
cies out of 14,137 sentences in the Penn tree- 
bank tree-tagged corpus, the Wall Street Jour- 
nal. The average frequency of each rule is 48 
times in the corpus. 

Of these rules, we remove rules which occurs 
fewer times than the average frequency in the 
corpus, and then only 192 rules are left. 
These removed rules are almost for peculiar 
sentences and the left rules are very general 
rules. We can show that our robust parser can 
compensate for lack of rules using only 192 rules 
with the recovery mechanism and heuristics. 

• Test set 

First, 1,000 sentences are selected randomly 
from the WSJ corpus, which we have referred 
to in proposing the robust parser. Of these sen- 
tences, 410 are failed in normal parsing, and are 
processed again by the robust parser. To show 
the validity of these heuristics, we compare the 
result of the robust parser using heuristics with 
one not using heuristics. Second, to show the 
adaptability of our robust parser, 
same experiments are carried out on 1,000 sen- 
tences from the ATIS corpus in Penn treebank, 
which we haven't referred to when we propose 
the robust parser. Among 1,000 sentences from 
the ATIS, 465 sentences are processed by the 
robust parser after the failure of the normal 
parsing. 

• Parameter adjustment 

We chose the best parameters of heuristics by 
executing several experiments. 



(^insertion 


10.2 


{^insertion 


: 15.0 


^■deletion 


10.4 


^deletion 


: 20.0 


^■mutation 


10.8 






Si 


0.01 


$2 


: 5.0 


Ss 


1.0 







Accuracy is measured as the percentage of con- 
stituents in the test sentences which do not cross any 
Penn treebank constituents (Black, 1991). Table ffl 



shows the results of the robust parser on WSJ. In 
table [lj 5th, 6th and 7th raw mean that the percent- 
age of sentences which have no crossing constituents, 
less than one crossing and less than two crossing re- 
spectively. With heuristics, our robust parser can 
enhance the processing time and reduce the number 
of edges. Also, the accuracy is improved from 72.8% 
to 77.1% even if the heuristics differentiate edges 
and prefer some edges. It shows that the proposed 
heuristics is valid in parsing the real sentences. The 
experiment says that our robust parser with heuris- 
tics can recover perfectly about 23 sentences out of 
100 sentences which are just failed in normal parsing, 
as the percentage of no-crossing sentences is about 
23.28%. 

Table || is the results of the robust parser on 
ATIS which we did not refer to before. The accuracy 
of the result on ATIS is lower than WSJ because 
the parameters of the heuristics are adjusted not by 
ATIS itself but by WSJ. However, the percentage 
of sentences with constituents crossing less than 2 is 
higher than the WSJ, as sentences of ATIS are more 
or less simple. 

The experimental results of our robust parser 
show high accuracy in recovery even though 96% 
of total rules are removed. It is impossible to con- 
struct complete grammar rules in the real parsing 
system to succeed in analyzing every real sentence. 
So, parsing systems are likely to have extragram- 
matical sentences which cannot be analyzed by the 
systems. Our robust parser can recover these extra- 
grammatical sentences with 68 ~ 77% accuracy. 

It is very interesting that parameters of heuris- 
tics reflect the characteristics of the test corpus. For 
example, if people tend to write sentences with in- 
serted phrases, then the parameter (^insertion must 
increase. Therefore we can get better results if the 
parameter are fitted to the characteristics of the cor- 
pus. 

4 Conclusion 

In this paper, we have presented the robust parser 
with the extended least-errors recognition algorithm 



Table 2: The results of the robust parser on ATIS 



Experiment 2 : ATIS 465 sentences 




with Heuristics 


without Heuristics 


Average sentence length 
Average processing time 
Average number of edges 
Accuracy (%) 
no-crossing sentences 
% of < 1-crossing sentences 
% of < 2-crossing sentences 


10.55 words (2-25 words) 

8.68 sec 

12974.2 

68.5 

26.02% 

47.10% 

66.24% 


10.55 words (2-25 words) 

71.98 sec 

25652.5 

59.4 

13.28% 

36.06% 

52.46% 



as the recovery mechanism. This robust parser 
can easily be scaled up and applied to various do- 
mains because this parser depends only on syntac- 
tic factors. To enhance the performance of the ro- 
bust parser for extragrammatical sentences, we pro- 
posed several heuristics. The heuristics assign the 
error values to each error-hypothesis edge, and edges 
which has less error values are processed first. So, 
not all the generated edges are processed by the ro- 
bust parser, but the most plausible parse trees can 
be generated first. The accuracy of the recovery in 
our robust parser is about 68% ~ 77%. Hence, this 
parser is suitable for systems in real application ar- 
eas. 

Our short term goal is to propose an automatic 
method that can learn parameter values of heuristics 
by analyzing the corpus. We expect that automat- 
ically learned values of parameters can upgrade the 
performance of the parser. 
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